Data has been pulled from the original database. The samples employeeHistory_sample.csv and companies_sample.csv are informed by deals_sample.csv, where they only inclucde the co_id's within deal_sample.csv.
deals_sample.csv has the following constraint...
...to minimize the size of each CSV to be under 100 MB. The data included in the CSV files represent approximately 30-50% of the data from the original database.
Simple analysis is included in each section which are msot relevant to the Exploratory Analysis. Minimal transformations are exercised.
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Load in Data from CSV
companies = pd.read_csv('data/companies_sample.csv',encoding='utf-8').convert_dtypes()
deals = pd.read_csv('data/deals_sample.csv', encoding='utf-8').convert_dtypes()
emp_hist = pd.read_csv('data/employeeHistory_sample.csv', encoding='utf-8').convert_dtypes()
companies.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 137827 entries, 0 to 137826 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 co_id 137827 non-null string 1 co_name 137827 non-null string 2 co_hq 137251 non-null string 3 co_city 137249 non-null string 4 co_state 137251 non-null string 5 co_year_founded 121593 non-null Int64 6 co_revenue 35463 non-null Float64 7 co_business_status 131423 non-null string 8 co_financing_status 137827 non-null string 9 co_ownership_status 137827 non-null string 10 co_primary_industry 137554 non-null string 11 co_primary_industry_group 137554 non-null string 12 co_primary_industry_sector 137554 non-null string 13 co_industries 137827 non-null string 14 co_verticals 137827 non-null string 15 co_keywords 137827 non-null string 16 co_description 137664 non-null string dtypes: Float64(1), Int64(1), string(15) memory usage: 18.1 MB
companies.head()
| co_id | co_name | co_hq | co_city | co_state | co_year_founded | co_revenue | co_business_status | co_financing_status | co_ownership_status | co_primary_industry | co_primary_industry_group | co_primary_industry_sector | co_industries | co_verticals | co_keywords | co_description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002-34 | Acton Marketing | Lincoln, NE | Lincoln | NE | 1982 | <NA> | Generating Revenue | Corporate Backed or Acquired | Acquired/Merged | Media and Information Services (B2B) | Commercial Services | Business Products and Services (B2B) | {"Media and Information Services (B2B)","Other... | {} | {"crosses selling","directing marketing","fina... | Provider of direct marketing services. The com... |
| 1 | 100003-15 | Premama | Providence, RI | Providence | RI | 2011 | 5.5 | Generating Revenue | Venture Capital-Backed | Privately Held (backing) | Specialty Retail | Retail | Consumer Products and Services (B2C) | {"Other Pharmaceuticals and Biotechnology",Pha... | {HealthTech,"LOHAS & Wellness"} | {"fertility care","maternity food","postnatal ... | Online retailer of medicine supplements intend... |
| 2 | 100004-68 | Northeast Agencies | Williamsville, NY | Williamsville | NY | 1984 | <NA> | Generating Revenue | Corporate Backed or Acquired | Acquired/Merged | Insurance Brokers | Insurance | Financial Services | {"Insurance Brokers"} | {} | {"insurance agent","insurance policy","insuran... | Wholesaler of insurance policies. The company ... |
| 3 | 100004-86 | Portamedic | Bernards, NJ | Bernards | NJ | <NA> | <NA> | Generating Revenue | Corporate Backed or Acquired | Acquired/Merged | Other Healthcare | Other Healthcare | Healthcare | {"Other Healthcare","Other Healthcare Services"} | {} | {"clinical research","health information servi... | Provider of health information services. The c... |
| 4 | 100005-67 | BCV Social | Chicago, IL | Chicago | IL | 2009 | 10.9 | Generating Revenue | Formerly VC-backed | Acquired/Merged (Operating Subsidiary) | Media and Information Services (B2B) | Commercial Services | Business Products and Services (B2B) | {"Media and Information Services (B2B)"} | {"Marketing Tech",TMT} | {"hospitality industry","manage crises","monit... | Provider of social media in the hospitality in... |
px.histogram(
companies[['co_year_founded','co_primary_industry_sector']].dropna(),
x='co_year_founded',
color='co_primary_industry_sector',
histnorm='percent'
).update_layout(
xaxis={'title_text':'year'},
yaxis={'title_text':'%'},
title_text='% Companies per Years Founded',
legend_title='primary industry sector'
)
px.box(
companies[['co_revenue','co_primary_industry_sector']].dropna(),
x='co_revenue',
y='co_primary_industry_sector',
log_x=True
).update_layout(
xaxis={'title_text':'revenue (log)'},
yaxis = {'title_text':'primary industry sector'},
title_text='Revenue per Primary Industry Sector<br><sup>Revenue is measured in millions'
)
deals.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 235895 entries, 0 to 235894 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 deal_id 235895 non-null string 1 co_id 235895 non-null string 2 co_name 235895 non-null string 3 co_hq 235895 non-null string 4 co_city 235893 non-null string 5 co_state 235895 non-null string 6 co_verticals 235895 non-null string 7 deal_date 235895 non-null string 8 deal_vintage 235895 non-null Int64 9 deal_number 235895 non-null Int64 10 deal_series 36216 non-null string 11 deal_vc_round 55037 non-null string 12 deal_type 235895 non-null string 13 deal_premoney 52794 non-null Float64 14 deal_postmoney 85425 non-null Float64 15 deal_size 235895 non-null Float64 16 deal_pct_acq 82048 non-null Float64 17 deal_investor_count 86224 non-null Int64 18 investor_id 235895 non-null string 19 investor_name 235895 non-null string 20 investor_gp_id 235895 non-null string 21 investor_gp 235895 non-null string 22 fund_id 235895 non-null string 23 fund_name 235895 non-null string dtypes: Float64(4), Int64(3), string(17) memory usage: 44.8 MB
deals.head()
| deal_id | co_id | co_name | co_hq | co_city | co_state | co_verticals | deal_date | deal_vintage | deal_number | ... | deal_postmoney | deal_size | deal_pct_acq | deal_investor_count | investor_id | investor_name | investor_gp_id | investor_gp | fund_id | fund_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 175465-27T | 434316-61 | Renavotio Infratech (PINX: RIII) | Tulsa, OK | Tulsa | OK | {} | 2020-12-31 | 2020 | 2 | ... | <NA> | 8.33 | <NA> | <NA> | {} | {} | {} | {} | {} | {} |
| 1 | 169689-07T | 53951-32 | ASG Technologies | Naples, FL | Naples | FL | {TMT} | 2020-12-31 | 2020 | 11 | ... | <NA> | 0.76 | <NA> | <NA> | {} | {} | {} | {} | {} | {} |
| 2 | 160795-09T | 181488-43 | Origin (Industrial Supplies and Parts) | San Francisco, CA | SanFrancisco | CA | {"3D Printing","Advanced Manufacturing",TMT} | 2020-12-31 | 2020 | 5 | ... | 97.11 | 97.11 | 100.0 | 1 | {54544-15} | {"Stratasys (NAS: SSYS)"} | {"54544-15-Yoav Zeif"} | {"Yoav Zeif"} | {} | {} |
| 3 | 169319-17T | 266103-37 | Forte (Software Development Applications) | San Francisco, CA | SanFrancisco | CA | {Cryptocurrency/Blockchain,Gaming} | 2020-12-31 | 2020 | 3 | ... | <NA> | 2.56 | <NA> | 1 | {439103-62} | {"TriplePoint Private Venture Credit BDC"} | {439103-62-None} | {NULL} | {} | {} |
| 4 | 172449-01T | 466030-45 | Kado | New York, NY | NewYork | NY | {Industrials,Mobile,SaaS,TMT} | 2020-12-31 | 2020 | 1 | ... | <NA> | 0.6 | <NA> | <NA> | {} | {} | {} | {} | {} | {} |
5 rows × 24 columns
px.bar(
deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id').groupby(['deal_vintage','co_primary_industry_sector'])['co_id'].nunique() \
.rename('n_deals').reset_index(),
x='deal_vintage',
y='n_deals',
color='co_primary_industry_sector',
barmode='group'
).update_layout(
title_text='N Deals per Year',
xaxis={'title_text':'deal vintage'},
yaxis={'title_text':'# deals'},
legend_title='primary industry sector'
)
temp = deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id').groupby(['deal_vc_round', 'co_primary_industry_sector'])['co_id'].nunique() \
.rename('n_deals').reset_index()
temp.loc[:,'deal_vc_round_num'] = temp['deal_vc_round'].str.split(r"(th|st|rd|nd) ").str[0].apply(lambda x: 0 if x == 'Angel' else x).astype('int64')
px.histogram(
temp,
x='deal_vc_round_num',
y='n_deals',
color='co_primary_industry_sector',
barmode='group',
histnorm='percent'
).update_layout(
title_text='% Deals per VC Round<br><sup>0 is an angel round',
xaxis={'title_text':'VC Round'},
yaxis={'title_text':'% deals'},
legend_title='primary industry sector'
)
temp = deals.merge(companies[['co_id','co_primary_industry_sector']], on='co_id')[['deal_vc_round', 'deal_size', 'co_primary_industry_sector']].dropna()
temp.loc[:,'deal_vc_round_num'] = temp['deal_vc_round'].str.split(r"(th|st|rd|nd) ").str[0].apply(lambda x: 0 if x == 'Angel' else x).astype('int64')
px.histogram(
temp.groupby(['deal_vc_round_num','co_primary_industry_sector'])['deal_size'].median().rename('median_deal_size').reset_index(),
x='deal_vc_round_num',
y='median_deal_size',
color='co_primary_industry_sector',
barmode='group',
log_y=True
).update_layout(
title_text='Median Deal Size per VC Round',
xaxis={'title_text':'VC Round'},
yaxis={'title_text':'Median Deal Size'},
height=500
)
"""Here, we are just separating and expanding on employee history column, co_employee_hist, which is formatted as string"""
emp_hist = emp_hist.join(emp_hist['co_employee_hist'].apply(lambda x: x[1:-1].replace('"', '').split(',')).explode().rename('expl_emp_hist')) # remove brackets and extra quotes, split into list then explode
emp_hist['emp_hist_yr'] = emp_hist['expl_emp_hist'].str.split(': ').apply(lambda x: x[0]).astype('int64')
emp_hist['emp_hist_cnt'] = emp_hist['expl_emp_hist'].str.split(': ').apply(lambda x: x[1]).astype('int64')
emp_hist.drop(columns=['expl_emp_hist', 'co_employee_hist'], inplace=True)
emp_hist.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 256152 entries, 0 to 83976 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 co_id 256152 non-null string 1 co_name 256152 non-null string 2 co_ticker 40836 non-null string 3 co_city 255655 non-null string 4 co_state 256118 non-null string 5 co_year_founded 248980 non-null Int64 6 co_business_status 255590 non-null string 7 co_financing_status 256152 non-null string 8 co_ownership_status 256152 non-null string 9 co_primary_industry 256000 non-null string 10 co_primary_industry_group 256000 non-null string 11 co_primary_industry_sector 256000 non-null string 12 co_industries 256152 non-null string 13 co_verticals 256152 non-null string 14 co_keywords 256152 non-null string 15 co_description 256081 non-null string 16 emp_hist_yr 256152 non-null int64 17 emp_hist_cnt 256152 non-null int64 dtypes: Int64(1), int64(2), string(15) memory usage: 37.4 MB
emp_hist.head()
| co_id | co_name | co_ticker | co_city | co_state | co_year_founded | co_business_status | co_financing_status | co_ownership_status | co_primary_industry | co_primary_industry_group | co_primary_industry_sector | co_industries | co_verticals | co_keywords | co_description | emp_hist_yr | emp_hist_cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002-34 | Acton Marketing | <NA> | Lincoln | Nebraska | 1982 | Generating Revenue | Corporate Backed or Acquired | Acquired/Merged | Media and Information Services (B2B) | Commercial Services | Business Products and Services (B2B) | {"Media and Information Services (B2B)","Other... | {} | {"crosses selling","directing marketing","fina... | Provider of direct marketing services. The com... | 2014 | 16 |
| 1 | 100003-15 | Premama | <NA> | Providence | Rhode Island | 2011 | Generating Revenue | Venture Capital-Backed | Privately Held (backing) | Other Consumer Non-Durables | Consumer Non-Durables | Consumer Products and Services (B2C) | {"Other Consumer Non-Durables","Other Pharmace... | {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... | {"fertility care","maternity food","postnatal ... | Online retailer of medical supplements intende... | 2015 | 5 |
| 1 | 100003-15 | Premama | <NA> | Providence | Rhode Island | 2011 | Generating Revenue | Venture Capital-Backed | Privately Held (backing) | Other Consumer Non-Durables | Consumer Non-Durables | Consumer Products and Services (B2C) | {"Other Consumer Non-Durables","Other Pharmace... | {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... | {"fertility care","maternity food","postnatal ... | Online retailer of medical supplements intende... | 2016 | 3 |
| 1 | 100003-15 | Premama | <NA> | Providence | Rhode Island | 2011 | Generating Revenue | Venture Capital-Backed | Privately Held (backing) | Other Consumer Non-Durables | Consumer Non-Durables | Consumer Products and Services (B2C) | {"Other Consumer Non-Durables","Other Pharmace... | {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... | {"fertility care","maternity food","postnatal ... | Online retailer of medical supplements intende... | 2017 | 6 |
| 1 | 100003-15 | Premama | <NA> | Providence | Rhode Island | 2011 | Generating Revenue | Venture Capital-Backed | Privately Held (backing) | Other Consumer Non-Durables | Consumer Non-Durables | Consumer Products and Services (B2C) | {"Other Consumer Non-Durables","Other Pharmace... | {E-Commerce,FemTech,HealthTech,"LOHAS & Wellne... | {"fertility care","maternity food","postnatal ... | Online retailer of medical supplements intende... | 2020 | 10 |
emp_hist.describe()
| co_year_founded | emp_hist_yr | emp_hist_cnt | |
|---|---|---|---|
| count | 248980.000000 | 256152.000000 | 2.561520e+05 |
| mean | 1993.118263 | 2014.024314 | 2.236470e+03 |
| std | 31.089360 | 7.359399 | 2.283181e+04 |
| min | 1800.000000 | 1812.000000 | 1.000000e+00 |
| 25% | 1986.000000 | 2010.000000 | 1.300000e+01 |
| 50% | 2004.000000 | 2016.000000 | 5.100000e+01 |
| 75% | 2013.000000 | 2020.000000 | 2.750000e+02 |
| max | 2021.000000 | 2022.000000 | 2.300000e+06 |
px.histogram(
emp_hist.groupby(['emp_hist_yr', 'co_primary_industry_sector'])['co_id'].nunique().rename('n_companies').reset_index(),
x='emp_hist_yr',
y='n_companies',
color='co_primary_industry_sector'
).update_layout(
title_text='N Companies with Employee Records For Year X',
xaxis={'title_text':'Year X'},
yaxis={'title_text':'N Companies'},
legend_title='primary industry sector'
)
px.histogram(
emp_hist.groupby(['emp_hist_yr','co_primary_industry_sector'])['emp_hist_cnt'].mean().rename('mean_n_emp').reset_index(),
x='emp_hist_yr',
y='mean_n_emp',
color='co_primary_industry_sector',
barmode='group'
).update_layout(
title_text='Average N employees per Year',
xaxis={'title_text':'Year'},
yaxis={'title_text':'N Employees'}
)
px.histogram(
emp_hist.groupby(['emp_hist_yr','co_primary_industry_sector'])['emp_hist_cnt'].mean().rename('mean_n_emp').reset_index() \
.query('co_primary_industry_sector != "Consumer Products and Services (B2C)"'),
x='emp_hist_yr',
y='mean_n_emp',
color='co_primary_industry_sector',
barmode='group'
).update_layout(
title_text='Average N employees per Year<br><sup>Not incl. Consumer Products and Services (B2C)',
xaxis={'title_text':'Year'},
yaxis={'title_text':'N Employees'}
)